Data-driven mapping refers to the process of using data values to determine the symbology of mapped features. Color, shape, and size are the three most common graphic elements used to symbolize data-driven maps. Data-driven maps are often referred to as thematic maps.
Instructor Notes
There are two primary types of thematic maps:
Choropleth maps: set the color of areas (polygons) by data value
Point symbol maps: set the color or size of points by data value
We review both of these types of maps in more detail in this lesson. First, let’s take a quick look at choropleth maps.
library(sf)
library(tmap)
Choropleth maps are the most common type of thematic map.
Let’s use an sf data.frame of counties data to make a choropleth map.
First, read in the counties data with the st_read function.
counties = st_read('notebook_data/california_counties/CaliforniaCounties.shp')
## Reading layer `CaliforniaCounties' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/california_counties/CaliforniaCounties.shp' using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -374445.4 ymin: -604500.7 xmax: 540038.5 ymax: 450022
## Projected CRS: NAD83 / California Albers
Then, make a map of our counties.
plot(counties$geometry)
Now, take a look at the spatial dataframe.
head(counties)
## Simple feature collection with 6 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -267387.9 ymin: -578013.2 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
## NAME STATE_NAME POP2012 POP12_SQMI WHITE BLACK AMERI_ES ASIAN
## 1 Kern California 851089 104.282870 499766 48921 12676 34846
## 2 Kings California 155039 111.427421 83027 11014 2562 5620
## 3 Lake California 65253 49.082334 52033 1232 2049 724
## 4 Lassen California 35039 7.422856 25532 2834 1234 356
## 5 Los Angeles California 9904341 2423.264150 4936599 856874 72828 1346865
## 6 Madera California 153025 71.065672 94456 5629 4136 2802
## HAWN_PI HISPANIC OTHER MULT_RACE MALES FEMALES MED_AGE HOUSEHOLDS
## 1 1252 413033 204314 37856 433108 406523 30.7 254610
## 2 271 77866 42996 7492 86344 66638 31.1 41233
## 3 108 11088 5455 3064 32469 32196 45.0 26548
## 4 165 6117 3562 1212 22416 12479 37.0 10058
## 5 26094 4687889 2140632 438713 4839654 4978951 34.8 3241204
## 6 162 80992 37380 6300 72682 78183 33.1 43317
## FAMILIES HSE_UNITS AVE_FAM_SZ VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1 191739 284367 3.61 29757 152828 101782 06103
## 2 31939 43867 3.59 2634 22329 18904 06089
## 3 16255 35492 2.94 8944 17472 9076 06106
## 4 6800 12710 2.98 2652 6590 3468 06086
## 5 2194080 3445076 3.58 203872 1544749 1696455 06073
## 6 34093 49140 3.63 5823 27726 15591 06102
## geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...
In particular, we are interested in the columns with numeric values as these are the ones typically used to make data maps.
To get started, let’s create a choropleth map by setting the color of each county based on the value in the population per square mile column (POP12_SQMI).
Recall that sf’s plot method does this by default! So, here’s the quickest way to make a choropleth:
plot(counties['POP12_SQMI'])
By default, sf::plot linearly scales the colors to the data values. This is called a proportional color map.
A proportional color map will have a legend with a continuous color ramp rather than discrete data ranges.
A key benefit of a proportional color map is that it depicts the full range of data values without imposing any groupings.
tmapWe can also use tmap to create thematic maps. This package gives us greater control over the visualization details.
In tmap, instead of setting the col argument to the same static value (e.g. ‘red’, ‘#ef03a5’) for all features, we can set it to the name of the column by which we want our polygons colored (e.g. ‘POP12_SQMI’).
# Set the mapping mode to a static plot (not interactive)
tmap_mode('plot')
## tmap mode set to plotting
# Map the county polygons colored by the values in the POP12_SQMI column
tm_shape(counties) +
tm_polygons(col='POP12_SQMI',
title = "Population Density per mi^2")
By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette for thematic maps and bins those colors into 3 to 7 classes of approximately equal intervals with rounded values for class breaks.
Of course, we can also use tmap’s interactive mapping mode. Do you recall the syntax for:
setting the tmap mode to static vs interactive mapping?
or toggling between these two modes?
Let’s make an interactive map, making our layer partially transparent, i.e. alpha = 0.4, so that we can see the basemap through our polygons.
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(counties) +
tm_polygons(col='POP12_SQMI', alpha=0.5,
title = "Population Density per mi^2")
That’s really the heart of of creating a choropleth map with tmap. To set the color of the features based on the values in a column, set the col argument to the column name in the sf data.frame (cast as a string!).
Redo the map above, but mapping population (POP2012) NOT population density.
# Map of County Population
Question
What map better conveys county population - POP12_SQMI or POP2012?
The goal of a thematic map is to use color to visualize the spatial distribution of a variable.
Another goal is to use color to effectively and quickly convey information. For example,
maps use brighter or richer colors to signify higher values,
and leverage cognitive associations such as mapping water with the color blue.
There are two major challenges when creating thematic maps:
Our eyes are drawn to the color of larger areas or linear features, even if the values of smaller features are more significant.
The range of data values is rarely evenly distributed across all observations and thus the colors can be misleading.
Questions
Do you see this either of these problems in our population-density map?
hist(counties$POP12_SQMI,breaks=40, main = 'Population Density per mi^2')
There are three main techniques for dealing with these mapping challenges:
Color palettes
Data transformations
Classification schemes
There are three main types of color palettes (or color maps), each of which has a different purpose:
diverging - a “diverging” set of colors are used so emphasize mid-range values as well as extremes.
sequential - usually with a single or multi color hue to emphasize differences in order and magnitude, where darker colors typically mean higher values
qualitative - a contrasting set of colors to identify distinct categories and avoid implying quantitative significance.
Tip: Sites like ColorBrewer let’s you play around with different types of color maps.
To see the names of all color palettes avaialble to tmap, try the following command. You may need to enlarge the output image.
RColorBrewer::display.brewer.all()
As a best practice, a qualitative color palette should not be used with quantitative data and vice versa. For example, consider this map that EDM.com published of top dance tracks by state.
For a number of reasons, data are often distributed in aggregated form. For example, the Census Bureau collects data from individual people, households and businesses and distributes it aggregated to states, counties, and census tracts, etc.
When the aggregated data are counts, like total population, they can be transformed to densities, proportions and ratios. These normalized variables are more comparable across regions that differ greatly in size.
Let’s consider this in terms of our data.
The basic cartographic rule is that when mapping areas that differ in size you never map counts since those differences in size make the comparison less invalid.
Another way to make more meaningful maps is to improve the way in which data values are mapped to colors.
The common alternative to a proportional color map is to use a classification scheme to create a graduated color map. This is the standard way to create a choropleth map.
A classification scheme is a method for binning continuous data values into 4-7 classes (the default is 5) and map those classes to a color palette.
tmapClassification schemes can be implemented using the tmap geometry functions (tm_polygons, tm_dots, etc.) by setting a value for the style argument.
Here are some of the tmap keyword names for classification styles that we can use (from the docs: ?tm_polygons):
equal, quantile,fisher, jenks, headtails, fixed, kmeans, pretty.For more information about these classification schemes see ?classIntervals or sources such as this page in the Lovelace, Nowosad, and Muenchow ebook.
Let’s redo the last map using the quantile classification scheme.
tmap_mode('plot')
## tmap mode set to plotting
# Plot population density - mile^2
tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
style="quantile",
alpha=0.5,
title="Population Density per mi^2")
Redo the previous map with these classification schemes: headtails, equal, jenks
You may get pretty close to your final map without being completely satisfied. In this case you can manually define a classification scheme.
Let’s customize our map with a user-defined classification scheme where we manually set the breaks for the bins using the classification_kwds argument.
tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
palette = "YlGn",
style='fixed',
breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
title = "Population Density per Sq Mile")
Since we are customizing our plot, we can also edit our legend to specify the text, so that it’s easier to read.
tm_add_legend to build our own customized legend.tm_shape(counties) +
tm_polygons(col = 'POP12_SQMI',
palette = "YlGn",
style='fixed',
breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
legend.show = F) +
tm_add_legend('fill', col = RColorBrewer::brewer.pal(6, "YlGn"),
border.col = "black",
title = "Population Density per Sq Mile",
labels = c('<50','50 to 100','100 to 200','200 to 300','300 to 400','>400'))
If we look at the columns in our dataset, we see we have a number of variables from which we can calculate proportions, rates, and the like.
Let’s try that out:
head(counties)
## Simple feature collection with 6 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -267387.9 ymin: -578013.2 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
## NAME STATE_NAME POP2012 POP12_SQMI WHITE BLACK AMERI_ES ASIAN
## 1 Kern California 851089 104.282870 499766 48921 12676 34846
## 2 Kings California 155039 111.427421 83027 11014 2562 5620
## 3 Lake California 65253 49.082334 52033 1232 2049 724
## 4 Lassen California 35039 7.422856 25532 2834 1234 356
## 5 Los Angeles California 9904341 2423.264150 4936599 856874 72828 1346865
## 6 Madera California 153025 71.065672 94456 5629 4136 2802
## HAWN_PI HISPANIC OTHER MULT_RACE MALES FEMALES MED_AGE HOUSEHOLDS
## 1 1252 413033 204314 37856 433108 406523 30.7 254610
## 2 271 77866 42996 7492 86344 66638 31.1 41233
## 3 108 11088 5455 3064 32469 32196 45.0 26548
## 4 165 6117 3562 1212 22416 12479 37.0 10058
## 5 26094 4687889 2140632 438713 4839654 4978951 34.8 3241204
## 6 162 80992 37380 6300 72682 78183 33.1 43317
## FAMILIES HSE_UNITS AVE_FAM_SZ VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1 191739 284367 3.61 29757 152828 101782 06103
## 2 31939 43867 3.59 2634 22329 18904 06089
## 3 16255 35492 2.94 8944 17472 9076 06106
## 4 6800 12710 2.98 2652 6590 3468 06086
## 5 2194080 3445076 3.58 203872 1544749 1696455 06073
## 6 34093 49140 3.63 5823 27726 15591 06102
## geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...
Let’s calculate the percent of the population that is hispanic and save it to a new column. Then, we can use that to create a choropleth map.
# calculate percent hispanic as a new column
counties$pct_hispanic = counties$HISPANIC/counties$POP2012 * 100
# Plot percent hispanic as choropleth
tm_shape(counties) +
tm_polygons(col = 'pct_hispanic',
palette = 'Blues',
style = 'fixed',
breaks= c(0,20,40,60,80,100),
border.col = "darkgrey",
lwd = 1.5,
legend.show=F) +
tm_add_legend('fill', col = RColorBrewer::brewer.pal(5, "Blues"),
border.col = "darkgrey",
title = "Percent Hispanic Population",
labels = c('<20%','20% - 40%','40% - 60%','60% - 80%','80% - 100%'))
Question
What new options and operations have we added to our code?
How many values do we specify in the breaks vector, and how many bins are in the map legend? Why?
Choropleth maps are great, but point maps enable us to visualize our spatial data in another way.
If you know both mapping methods you can expand how much information you can show in one map.
For example, point maps are a great way to map counts because the varying sizes of areas are deemphasized.
The tm_dot element makes it easy to create point maps dynamically from polygon data!
# County population counts as a point map!
tmap_mode('plot')
## tmap mode set to plotting
# Add the county polygon borders as a basemap
tm_shape(counties) +
tm_borders(col="grey") +
# Then map the county centroids as points colored by population counts
tm_shape(counties) +
tm_dots(col = 'POP2012',
palette = 'YlOrRd',
style = 'jenks',
border.col = "black", # dot borders only visible in interactive mode!
border.lwd = 1,
border.alpha=1,
size=.5,
legend.show=T)
This is another useful type of data transformation for making effective maps.
Let’s read in some data that is more typically encoded with point geometry - Alameda County schools.
schools_df = read.csv('notebook_data/alco_schools.csv')
head(schools_df)
## X Y Site Address City
## 1 -122.2388 37.74476 Amelia Earhart Elementary 400 Packet Landing Rd Alameda
## 2 -122.2519 37.73900 Bay Farm Elementary 200 Aughinbaugh Way Alameda
## 3 -122.2589 37.76206 Donald D. Lum Elementary 1801 Sandcreek Way Alameda
## 4 -122.2348 37.76525 Edison Elementary 2700 Buena Vista Ave Alameda
## 5 -122.2381 37.75396 Frank Otis Elementary 3010 Fillmore St Alameda
## 6 -122.2616 37.76911 Franklin Elementary 1433 San Antonio Ave Alameda
## State Type API Org
## 1 CA ES 933 Public
## 2 CA ES 932 Public
## 3 CA ES 853 Public
## 4 CA ES 927 Public
## 5 CA ES 894 Public
## 6 CA ES 893 Public
We got it from a plain CSV file, let’s promote it to an sf data.frame.
schools_sf = st_as_sf(schools_df,
coords = c('X','Y'),
crs = 4326)
Then we can map it.
plot(schools_sf)
What is useful about the above display of the maps for each column in the dataframe is that at a glance you can see the type of data variable and get a sense of the range of values.
The default sf::plot point map for a numeric data column is a proportional color map that linearly scales the color of the point symbol by the data values.
# Point map of API - Academic Performance Index
plot(schools_sf['API'])
tmapLet’s try creating the same map with tmap.
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(schools_sf) +
tm_dots(col="API")
The basic tmap graduated color map needs some customization to shine, especially in plot mode!
By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette and the pretty classification scheme for point thematic maps. These are the same defaults that are used for tmap choropleth maps. But point maps that symbolize data values by color are called Graduated Color Maps. In spite of the different map names, the color and classification scheme options are almost identical in tmap! However, some options will be different - for example, a size parameter makes sense for a point radius but not a polygon!
See
?tm_dotfor more information about the options for customizing point maps! For example…
# API Graduated Color Map
tm_shape(schools_sf) +
tm_dots(col='API',
size=0.15,
palette='Reds',
style='fixed',
breaks=c(0, 200, 400, 600, 800, 1000),
border.col='grey',
legend.show=F) +
tm_add_legend('fill', title='Alameda County, school API scores',
labels = c('<200', '[200,400)', '[400,600)', '[600,800)', '>800'),
col = RColorBrewer::brewer.pal(5, "Reds")) +
tm_layout(legend.position = c('right','top'))
Another important type of point map is the proportional symbol map. These are like proportial color maps but instead of associating symbol color with data values they associate symbol size. You can make these in tmap with the tm_bubbles function.
The schools data does not contain any good variables for proportional symbol mapping so we will read in a supplemental file of NCES data and join it to the school points.
df = read.csv('notebook_data/other/PolicyMap_NCES_Data_20210429.csv')
#head(df,2)
df2 = df[c('School.Name','Student.Teacher.Ratio','Free.and.Reduced.price.Lunch.Eligible.Students')]
colnames(df2) <- c('Site','STRatio','RLunch')
#head(df2,2)
schools_sf2 <- merge(schools_sf, df2, by="Site")
#head(schools_sf2,2)
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(schools_sf2) +
tm_bubbles(size="RLunch",
col="pink",
border.col='black',
title.size="Students Eligible for Free/Reduced Lunch" ) +
tm_layout( legend.position = c('right','top'))
Mapping categorical data, also called qualitative data, is a bit more straightforward. There is no need to scale or classify data values. The goal of the color map is to provide a contrasting set of colors so as to clearly delineate different categories. Here’s a point-based example:
tm_shape(schools_sf) +
tm_dots(col='Org', size=0.15, palette='Spectral', title="School Type")
We learned about important data driven mapping strategies and mapping concepts, including:
Point and polygons are not the only geometry-types that we can use in data-driven mapping! You can also map linear features by associating data values with the color, shape and size of features. But these types of maps are less common.